library(ggplot2)
library(dplyr)
library(purrr)
library(gridExtra)
library(scales)
library(tidyr)
library(GGally)
library(reshape2)
library(memisc)
library(RColorBrewer)

Let’s load the dataset and have a look on the data.

Dataset introduction

This data was scraped from IMDB website and published on Kaggle by user chuansun76 about a year ago.

It contains information (both basic info provided by IMDB and complimentary data from other sources) about some 5000 movies directed in the last 100 years in 66 countries of the world. Each observation (movie) is described by 28 variables such as director name, movie genre, budget etc.

Let’s check out the dimensions of the dataset:

## [1] 5043   28

The dataset has 5043 observation across 28 variables.

Let’s have a closer look on our variables

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 927 801 2027 377 603 106 2030 1652 1228 551 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 534 2432 2549 1227 801 2439 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 302 979 353 1968 526 440 785 221 336 32 ...
##  $ movie_title              : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 398 2731 3279 3708 3332 1961 3291 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1392 3134 1769 1 2714 1969 2162 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

We have 12 factor and 16 numeric variables. All variable types look appropriate and do not require changes.

All variables could be divided in three groups: movie production characteristics (title year, duration, genre), popularity characteristics (number of voted users, number of user reviews, movie facebook likes) and financial chracteristics (budget and gross).

Now perform basic sanity check of the data

##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##                     movie_title   num_voted_users  
##  Ben-Hur                  :   3   Min.   :      5  
##  Halloween                :   3   1st Qu.:   8594  
##  Home                     :   3   Median :  34359  
##  King Kong                :   3   Mean   :  83668  
##  Pan                      :   3   3rd Qu.:  96309  
##  The Fast and the Furious :   3   Max.   :1689764  
##  (Other)                  :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329

For categorical variables basic sanity check revealed many NAs and blank cells. I suspect we also have some duplicate string, according to the summary of our movie_title variable. Also the variables in genre column contain several values in one cell. We would need to split those values in order to be able to analyse the data.

As for numeric variables, my main concern is budget column with maximum value of 12 bln! We would need to double check currencies of the budget and gross values for all the movies.

Another potential problem with financial variables is: IMDb site shows the budget and gross numbers announced at the year of movie release. Here we need to be careful as USD100 in 2005 are not the same money as USD100 in 2016 due to inflation and other factors. So we would need to standardize both gross and budget columns to the same base, in case we want to include them in our analysis.

Now we build coorelation matrix and visualize to check the correlation coefficients with numeric variables and decide which variables we will investigate in our analysis.

There is a considerable positive correlation between IMDb score and movie duration, number of voted users, there is also noticable negative correlation between imdb score and title year. We will consider those variables in our analysis.

Posing question and defining features of interest

In this analysis we will investigate factors that contribute to movies IMDB score.

There are several main features of interest that we selected for our analysis based on correlation coefficients and intuition :)

Numerical features:

  • title_year

  • duration

  • number of users voted

  • num_user_for_review

  • num_critic_for_review

  • movie_facebook_likes

  • budget

  • gross

Categorical features:

  • genre

Dataset Preparation and Cleaning

As noted above, we need to standardize variables ‘gross’ and ‘budget’ to account for inflation rate across years to be able to compare and analyse those variables, we also need to convert all currencies to USD, since for many movies shot outside the US, movie budget is indicated in local currency (the enormous figure of 12 bln budget reveal by sanity check, is shown in Korean wons, which is some 10 mln dollars).

By looking at the source code for the dataset webscraping part we can see that the budget and gross data were scraped with currency description, so most likely the currency description was dropped while parsing the scraped data.

We made some minor adjustments to the dataset parsing source code to serve our purpose and created new column for currency, we also converted all currency signs (e.g. ‘$’, ‘€’ ETC) to abbreviations (‘USD’, ‘EUR’ etc).

After we did some cleaning on our updated dataset, let’s get right to standardizing financial varaibles so first we summarise our data by currency and title year

## Source: local data frame [10 x 3]
## Groups: budget_currency [3]
## 
##    budget_currency title_year     n
##             <fctr>      <int> <int>
## 1             AUD        1986     1
## 2             AUD        1988     1
## 3             AUD        1989     1
## 4             AUD        1997     1
## 5             AUD        2006     2
## 6             AUD        2009     1
## 7             BRL        2015     1
## 8             CAD        1994     1
## 9             CAD        1997     2
## 10            CAD        1999     1

Now we convert all budget values to USD, then standardize our values to 2016 base, accounting for inflation across years.

So we need to add exchange rate for each currency in the given year, we do that scraping information from this superhelpful website. and writing our scraped data frame into separate csv file so we don’t have to do scraping every time we run the code. The web scraping parts was performed in R and the scrip could be found in exrates_scraping_script.Rmd file.

Then we create a column in our original movies dataframe with the exchange rate and then convert movies budget to USD.

Let’s examine our data now. We inserted exchange rates for all currencies but USD, so let’s see if all foreign currencies were provided with exchange rates now.

##      director_name movie_title budget_currency title_year
## 2439    Fritz Lang Metropolis             DEM        1927

We have one row without exchange rate, this is because the exchage rate from 1927 was not covered in our scraped data. We will insert the rate for this row manually, using historical records from this site

For now we also set the exchange rate for USD as 1.

To convert budget values to USD we use the provided exchange rate for each currency, then we would need to standardize all budget and gross values accounting for inflation rate.

To convert the budgets to 2016 year base, we use Consumer Price Index (CPI): Past dollars in terms of recent dollars = Dollar amount × Ending-period CPI ÷ Beginning-period CPI.

The US CPI for the period 1900 - 2016 was downloaded as csv file from here

Ending period CPI is 2016 and is equal to 240.01, while the beginning period CPI for each budget value we indicate in our dataset by creating new column cpi.

We also need to tidy genres column as it contains several values in one cell, so we need to split the values to be able to analyse this data. For that we create new dataframe, split genres values and create new rows for each genre value.

Now looks like we cleaned and prepared the data and now we are ready to dive into analysis

Univariate Plots Section

Firstly, let’s have a closer look on summary and distribution of our key variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.900   6.600   6.463   7.200   9.300

The distribution is a bit skewed to the left but otherwise looks like normal distribution, with most data points are witin 6.0 and 7.5 interval with minimum value at 1.5, maximum value at 9.3, mean and median at the center of the distribution.

Let’s have a look on movies distribution by title_year

## 
## 1927 1929 1933 1935 1936 1937 1939 1940 1946 1947 1948 1950 1952 1953 1954 
##    1    1    1    1    1    1    2    1    2    1    1    1    1    2    2 
## 1957 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 
##    1    1    1    1    2    3    5    5    1    1    2    3    4    3    2 
## 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 
##    5    6    3    2    7    7    6   13   17   14   13   20   15   25   30 
## 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 
##   29   32   26   30   33   44   51   66   92  100  114  156  157  174  185 
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 
##  145  174  177  185  147  180  176  163  166  151  157  137  116   55

Let’s consolidate datapoints by cutting title_year variable in decades using R function cut and then plot the result

The major part of the movies presented in the dataset were directed between 1990s and 2010s, with the peak at 2000s decade, whereas the old movies are not represented that well in the dataset.

Let’s explore our genres now. We use our genres_df dataframe in order to visualize the distribution of movies by genres.

The top-5 genres of the movies in our dataset are : Drama, Comedy, Thriller, Action and Romance.

The top-5 least represented movie genres are: Film Noir, Documentary, Western, Musical, Sport.

Now let’s explore closer movie budgets. In order to draw appropriate conclusions we consider standardized variables.

Let’s check out summary first

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       284  13690000  32010000  48030000  66890000 347300000

Our values are quite skewed (minimum budget is some USD284.-, while the values of 75% of the data varies from USD13.38 mln to USD66.71 mln and the maximum budget was enormous USD347.3 mln),therefore we would need to trasnform our data to see the values distribution more clearly. We do that using log10 trasnformation. For comparison sake, we put non-transformed and trasnformed data together.

The transformed data is a bit skewed to the left. Movie budgets look to be quite diverse ranging between tens of million dollars to hundreds of million dollars.

Let’s compare the average imdb score for top-5 highest budget and top 5 lowest budget movies

## [1] "Overall mean IMDB score: 6.46"
## [1] "Mean IMDB score for higherst budget movies: 7.15"
## [1] "Mean IMDB score for lowest budget movies: 7.02"

Since the distribution of our IMDb scores is almost normal, let’s use the benefits of Central Limit Theorem to estimate how far from each other are the above mean scores. We calculate the standard diviation first:

## [1] 0.1261012

This is curious how close are average scores for highest budget and lowest movies. They are less than .12 standard deviation apart ! It is also curious that the lowest budget movies have imdb score higher then average.

Looks like the budget size and imdb_score are not related to each other, which goes quite against my inuition : I would definitely favour high budget movies over low budget one for better graphics, soundtrack, cast etc. So we would need to look into budget variable in tandem with other variables while performing bivariate analysis.

Now let’s explore movies gross distribution

Since the distribution of the gross variable is quite skewed to the right, we performed logariphmic transformation to visualize the data. The transformed distribution just like budget is slightly skewed to the left but otherwise looks normal.

Nevertheless, we need to remember that gross values indicated in this dataset are box office revenues earned in the US only and certainly a lot of movies with low gross in US might have grossed much more in their country of origin (E.g. South Korean “The Host” grossed only USD314,488 in the US, while in South Korea the movie box office was KRW 10,002,411,650, which is USD 8,748,609).

To elaborate on movies profitability, we can also calculate movies profit and then visualize profit distribution for all movies.

This is amazing how many movies have profit just above zero. Looks like the big part of the movies are making profits just to cover their budget expences.

However just as we observed above, gross numbers are indicated for the US only, let’s consider only US movies for our profitability analysis

As we can see the general trend for profitability hasn’t changed drammatically. The center of the distribution is very close to 0, which means almost half of American movies in this dataset grossed just enough to cover their budget expenses.

Now let’s convert profit absolute numbers into a ratio and calculate return-on-invetsment (ROI) percentage for each movie (the gain/loss generated on an investment relative to the amount of money invested)

We use the following formula: roi = profit / budget *100

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -100.0    -52.5     10.3    535.5    123.8 719300.0

The maximum ROI value of 719 300% looks quite suspicious, let’s investigate it

##               movie_title budget_2016 gross_2016
## 3588 Paranormal Activity     17363.51  124921516

Ok, “Paranormal Activity” is an ultra low-budget horror movie, that made an enourmous box office when it went on the screen, it was also acknowledged as the most profitable movie of all time

But let’s exclude such anomalies from our analysis and analyse the majority of the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -100.00  -54.50    4.53   61.12  104.10  977.70

We see a curious spike just around -100, which basically means that having invested some budget the movie generated profit close to 0. The distribution is heavily skewed to the right and the typical value (median) of ROI percetage is justabout 5%.

Now let’s explore the duration of the movies: we calculate summary and plot the distribution first.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    37.0    96.0   106.0   110.2   120.0   330.0

The distribution peaks at about 100, which makes sense since the majority of movies are all about 1 hour and a half. However there are some short movies, with duration of 30 mins and below as well as 5 hours - long movies.

Now look into the number of voted users variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      91   19140   53030  104700  126600 1690000

The distribution is heavily skewed to the right, so we make logariphmic transformation to see it more clearly.

The transformed data is distrbuted almost normally, with the number of voted users peak at about 175 000 and the biggest part of the data is concentrated in range 120 000 and 1 100 000.

The last two variables we included in our analysis are: the number of users who made movie review on IMDb web-site (num_user_for_review) and the number of reviews for the movies made on external web-sites. For the purpose of this analysis it will make more sense to explore these two variables in tandem and therefore we sum them up into a new variable: total_movie_reviews.

The data distribution has tail to the right, and peaks at about 250 reviews. The biggest part of the movies have 100 - 400 reviews in total, however there are still some movies with more than 1000 reviews.

Do the movies with big number of votes have also big number of user reviews?

Now we explore summary and distribution of movie facebook likes

Our data is skewed, it also has some curious gaps at regular intervals, I suspect facebook likes data in our dataset is incomplete and the values are not missing at random: there might be some bugs in data collection code.

The number of facebook likes for the majority of movies is below 1000, however for some movies the number of likes rises up to 50 000.

Univariate Analysis

What is the structure of your dataset?

The original dataset contains 28 variables and 5043 observations. For the purpose of this analysis the number of observations was reduced by removing duplicate rows and NAs, we also created some new supporting variables, so the final dataset has 3655 observations across 38 variables.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is IMDb score: we aim to define the variables that contribute to movie IMDb score.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The features that support our exploration of IMDb score could be divided into three main groups :

  • movie profitability characteristcs (gross, profit and ROI);

  • movie popularity characteristics (number of voted users, movie total reviews);

  • movie production characteristics (duration, genre and title year).

Did you create any new variables from existing variables in the dataset?

To measure profitability of the movies, we created two new variables: profit (revenue earned by a certain movie in the US, expressed in an absolute number) and ROI (revenue generated by a certain amount of investment, exressed in percentage).

We also created several complimentary variables while converting movies budgets to USD and than standardizing the numbers accounting for inflation rate over the years. This allowed us to analyse and compare budgets and gross numbers of the movies from different countries, directed at different times.

On the top of that we summarized two movie popularity variables number of movie reviews on IMDb website and number of reviews on other websites and created one variable of total reviews for a movie. of the movie.

We also consolidated movie title years by cutting them in decades and creating new variable title_year.bucket for it.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

We started off, by tidying our financial variables.

We also made some adjustments to our genres variable, by unwinding them into a separate dataframe.

The biggest surprise so far was that profit levels for the movies, which were surprisingly low: almost half of the movies had profit levels below 0. Even after excluding foreign movies from profitability analysis (since the profit number indicate movie revenues only in the US, we assumed that many movies with low profits in the US might have much better profit in their country of origin), the median profit value was close to 0 and median ROI percentage only 5%.

We also found an interesting pattern in movie_facebook_likes variable distribution: the data points on histogram are missing at regular intervals, which led us to assumption that this data might be incomplete with datapoints are not missing at random which might be caused by some mistakes during data collection process.

Bivariate Plots Section

Let’s plot the key variables using the ggpairs function. We also include our new varibles: ROI, profit and movie total reviews.

As per the correlation matrix, IMDb score has the highest correlation coefficients with number of voted users and the total number of movie reviews. There is also some considerable corelation between IMDb score and duration of the movies and movies gross

Let’s consider these pairs one at a time

IMDb score vs. Number of voted users

After reducing overplotting by adjusting alpha param, we can see that low-rank movies tend to have the lowest numbers of user votes, however the movie rank increases with the number of votes.

We might think that the IMDb ranking of a particular movie is FORMED by the number of user votes for it but that is not so. IMDb ranking system is based on weighted average, so the movies that have many users voted, but most users rated a movie with 6 out 10, will have lower ranking than movies with less users voted but most of users rated a movie with 10 out of 10.

IMDb score vs. Movie Total Reviews

Again, we see that with the number of total reviews of the movie tend to increase the IMDb score increase too: low-rank movies (IMDb score <5) having up to 750 reviews and higher ranking movies - up to 1700.

However let’s explore if total reviews and number of voted users relate to each other

We can notice considerable correlation between variables, which makes sense, since with more people voted for a certain movie, the likelihood that they would leave a review for the movie increase. However, the relationship seems quite obvious for the movies with low number of votes and low number of reviews, whereas for movies with higher number of votes the number of movie reviews is quite dispersed.

IMDb score vs. Movie Facebook Likes

We are also interested to investigate the relationship between movie IMDb score and movie facebook likes.

Low-score movies tend to have very low number of facebook likes, but it increases for movies with IMDb score higher than 5.5, but having passed this mark we loose the pattern : all movies with IMDb scores > 5.5 have equal chances to get high number of facebook likes.

IMDb score across genres

Plotting IMDb score by genres gives quite stable picture : mostly normal distribution for each genre, that peaks at overall mean. So we cannot say that some particular genres may have better ranking on IMDb than others.

Now let’s see if we can reveal any relationship between movie duration and movie IMDb score

As per our plot the biggest part of the movies are within 80 to 140 minutes: the duration of low-ranking movies is within 80-120 mins, while for the movies with imdb_score above 5 the duration is much more disperced and falls in range 100 - 160. The general pattern is quite obvious with the increase of duration, movie IMDb score increases too.

This also proved by the fact that all epic Oscar winners in recent decades were at least 120 mins long : ‘Titanic’, ‘Braveheart’, ‘Godfather’ back in 90s, ‘Slumdog Millionaire’, ‘The Lord of The Rings’ in 00s and ‘The King’s Speech’, ‘Argo’ in 10s,

Let’s investigate the average movie duration over the years. We expect to see the increase in mean movie duration in recent decades because of switch from analogue to digital film making (shooting 3 hours long movie on 35 mm film is way more costly than on hard drive) so saving on production, directors can spend more money on elaborating the movie story line, which resulted in longer movie duration.

To do that we create new data frame with our variables grouped by title years and summarized with mean and median values per each group.

Looks like our assumption is incorrect, because as per the plot the average movie duration is actually decreased over time : since we have very few old movies (before 1960s), we see those substantial fluctuations on mean duration graph, and with the number of movies for a given year increases, the graph gets smoother, however the general trend is still quite visible : mean duration is decrasing in late years.

Let’s check out budget and duration relationship

We can see curious vertical lines on the plot - this is probably because very often movie budget is aproximated as ‘good looking’ number (10 million dollars, 20 million dollars etc). Besides, we can see that there is no evident relationship between budget and duration: in fact the longest duration movies tend to be low or medium budget.

Now let’s explore the relationships of IMDn score with financial variables (gross, profit and ROI)

IMDb score vs. Gross

Let’s summarise our plot and see the varaition of mean and median gross for each IMDb rank and add it as a layer to our previous plot, we also adjust alpha parameter to reduce overplotting.

This will help us to see the general trend in gross figures as imdb score increases.

There are some occasional spikes in mean gross, however the general trend is still visible: with the mean gross increase, movie IMDb score increases.

IMDb score vs. Profit

The profits of the low-ranking movies (imdb_score < 5) seem to be mainly below 50 million dollars, whereas high-ranking movies profits are very diverse and could reach up to 250 million.

However, there is a considerable part of movies that have good imdb score with very moderate profits (within 25 million dollars and below). In fact, the “Shawshank Redemption”, which is IMDb highest rank movie ever earned only modest 5.5 million dollar profit!

Let’s examine profit and IMDb score relationship across diffrent genres

Splitting profits by genres did not reveal any new patterns in our data.

However we also want to see how movie IMDb score changes with the increase of mean and median profits so let’s summarise our profit numbers for each IMDb score and add this information to our plot.

Sure enough, average profits for low-ranking movies tend to be low , sometimes even negative, however movie ranking inceases with mean profits increase.

Now we want to see if there is any relationship between profits and ROI ratio. Intuitively, we can assume that higher profits bring higher ROI.

Looks like there is hardly any visible relationship at all: there is a bunch of highly profitable movies with low ROI percetage and there are a lot of movies with both high profits and high ROI percentage. This is quite strange since profit is taken into account while calculating ROI ratio.

Now we want to see if ROI percentage is connected with IMDb score

In fact, ROI data points distributed across various IMDb scores very similarly to profit variable.

Now we want to see ROI in dynamics across different IMDb scores, to achieve that we apply the same technique as above: summarise ROI numbers for each imdb score and plot them over imdb_score - roi scatterplot

The summary graphs are very noisy but we still can observe similar trend as above : low-ranking movies tend to have lower average ROI percentage as movie with higher IMDb ranking.

Lastly, we want to investigate if there is any relationship between IMDb score and movie genre. Do movies of certain genres score higher than others ?

IMDb score vs. Genre

The median IMDb score for almost all genres are within 6 - 7 range, however the median IMDb score for Film-Noir, and News genres rises considerably above the general trend and median IMDb score for Game Show and Reality TV falls below the general trend.

It is curious, that there are a lot of outliers below Q1 of the imdb-score for each genre and very few outliers above Q3. This means, that genre-wise the distribution of IMDb scores is skewed towards lower IMDb scores, we can also notice that IMDb low-ranking movies are present in almost all genres.

Let’s zoom in the upper part of our boxplot to see which genres tend to have higher IMDb score than others.

The highest ranking have Crime movies, with Action, Dramas and Thrillers being close behind.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

As per our analysis, IMDb score is closely related to movie popularity parameters (high ranking movies tend to have higher number of votes and reviews from users) and profitability parameters (movies with higher IMDb score tend to have higher average profitability and ROI percentage), whereas the relationship between movie IMDb ranking and movie production characteristics (title year, duration, genre) seems to be quite weak.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The most interesting relationship was between mean movie duration and title year. With the switch from analogue to digital movie making the costs of movie production were reduced significantly (buying and handling lots of 35 mm film reels is way more costly than spend a couple hundreds of dollars for a hard drive), so logically the directors had more budget to elaborate on the story line, therefore we expected longer movie duration in the period from 1990s till 2010s. However, having ploted the mean movie duration across years, we found that it actually decreased comparing to the earlier periods.

What was the strongest relationship you found?

We should admit that the data in this dataset is quite diverse: there are relationships that are true for some movies but do not work for the others. For example, we found an interesting relationship between movie IMDb score and number of facebook likes: low-score movies tend to have very low number of facebook likes, but it increases drammatically for movies with IMDb score higher than 5.5, but having passed this mark we loose the pattern : all movies with IMDb scores from 5.5 and up have equal chances to get high number of facebook likes.

However, the strongest general trends that are true all data are relationship between financial characteristics : we noticed positive correlation between mean gross and IMDb score and betwee mean profit/ROI and IMDb score.

There is also a strong posistive correlation between movie IMDb score increases and the number of voted users.

Multivariate Plots Section

In this section we want to investigate deeper the relationships we found in bivariate analisys.

Firstly let’s look at movies budgets.

IMDb score vs. Budget vs. Genres

At first glance this looks like a dead end : there is no visible trend in IMDb score vs. profit for any particular genre.

This goes against our intuition since movies budget DOES influence the quality of the movie (the high budget allows better graphics, cast, soundtrack which increase the movie quality and consequently should increase it’s IMDb score - I would more often prefer high budget movie over low budget), so let’s look at this data from another angle: we facet our general scatter plot by genre.

Facetting gives much more clarity and we see that all movie genres could be divided in two big groups: the first group where IMDb score is not changing with mean budget increase, it also often falls below the average budget across all movies (Romance, Sport, Mystery, Documentary, Horror), the second group where the average budget increase with the increase of IMDb score (Action, Adveture, Sci-Fi, Family, Animation).

So just as we anticipated budget does effect movie IMDb score, but only in genres where the budget increase will influence the quality of the movie (e.g.increasing budget for SciFi movie allows better graphics that contributes to movie quality: people won’t appreciate a SciFi movie with cheap graphics; at the same time this won’t work for Documentary or Romance since the increase in budget won’t contribute to movie quality that much).

Let’s continue our genre wise investigation and explore the relationship between mean duration and IMDb score of the movies genre wise.

There we see a clear pattern across all genres : IMDb score increase with movie duration increase.

Based on what we’ve investigated so far we see that though the general trend was not visible in bivariate analysis, dissecting it genre wise really helped us to reveal the trend.

IMDb score vs. profit vs. budget

Here the pattern is not visible as well, some movies have high budget but low profit and low IMDb score and there are also low-budget movies that earned high IMDb score and made good profit. However, we should admit that the higher ranking a movie has, the more chance it may return high profit.

IMDb score vs. budget vs. number of voted users

This plot comfirms that high budget does not guarantee movie popularity and high IMDb score, as we can see that only high score movies receive more number of votes regardless of their budget.

Again, we must keep in mind that IMDb score is not formed by the number of votes because it is based on weighted mean not arithmetic mean (so the movie with fewer votes but high scores (e.g. total 10 votes with score 9) will have higher IMDb rank then movie with more votes but lower scores(e.g. totel 100 votes with score 7)) so the pattern confirms that people tend to be willing to vote for the movies with higher IMDb scores which are not necesserily blockbusters.

IMDb score vs. number of voted users vs. number of movie reviews

The number of movie reviews is increasing together with the number of votes from users and IMDb score.

IMDb score vs. number of voted users vs. title_year.bucket

Here we also see very interesting pattern: IMDb users tend to vote more for the recent movies (released in 1990s - 2010s) regardless of their IMDb score and old movies, while having quite high IMDb score tend to receive less user votes: we notice mostly pink and purple dots representing recent movies on the upper part of the plot and blue, green and orange dots representing older movies at the bottom.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We investigated two types of movies characteristics from different angles and in combination with each other to determine if they effect movies IMDb scores.

There are few interesting relationships :

  • people tend to be willing to vote for the movies with higher IMDb scores which are not necesserily blockbuster i.e high budget does not guarantee popularity to the movie.

  • users tend to vote more for the recent movies (released in 1990s - 2010s) regardless of their IMDb score and old movies, while having quite high IMDb score tend to receive less user votes;

  • movie reviews and number of votes from users are increasing in tandem together and make IMDb score increase too.

Were there any interesting or surprising interactions between features?

One of the big surprises was that at first glace we did not see any relationship between movie IMDb score, genre and budget. This looked a bit weird to me, because I would favour high budget movie over low budget mainly because more budget allows better graphics, soundtrack, famous actors and so on). I think this is quite usual behavior among movie-lovers so I was confused why my plots do not reflect it.

I decided to analyse IMDb score and budget for each genre separately and discovered that the trend that I was expecting to see is there but not for all genres: those genres that are budget-prone (i.e. where the budget can improve the quality of a movie e.g.for movies involving a lot of graphics higher budget will allow better graphics and therefore the overall movie quality will become better, genres like SciFi, Action, Adventure etc.) IMDb score increases with increasing the budget. However in genres that are not very budget-prone (Romance, Documentary) this trend is not present.

Likewise,I analysed IMDb score and movie duration relationship across genres and discovered the patern that was not visible in bivariate analysis : with movie duration incraese, movie IMDb score increases as well.


Final Plots and Summary

Plot One

Description One

The distribution of movies IMDb scores is slightly skewed to the left but otherwise looks normal, peaking at 6.5 - 7.0 with minimum value at 1.5 and maximum value at 9.3, the mean and median of the distribution being very close (about .13 standard deviation apart)

Plot Two

Description Two

One of the strongest factors that influence movie IMDb score is number of user votes on IMDb website, however the relationship is not linear since IMDb ranking syster is based on weighted average, where scores have different weight (i.e. movie with 10 votes all scoring movie with 9 has higher IMDb score than movie with 100 votes scoring movie with 6).

Plot Three

Description Three

Though budget and IMDb score relationship is not clear while analysing budgets and score for all movies it becomes clear when analysed for each genre separately.

Those genres that are budget-prone (i.e. where the budget can improve the quality of a movie e.g.for movies involving a lot of graphics higher budget will allow better graphics and therefore the overall movie quality will become better, genres like SciFi, Action, Adventure etc.) IMDb score increases with increasing the budget. However in genres that are not very budget-prone (Romance, Documentary) this trend is not present.


Reflection

The original movies dataset contains information about 5043 movies across 28 variables.

Since the dataset was scraped from the IMDb website, the raw data had to be cleaned before starting the analysis. After removing duplicate rows and NAs and creating some new variables, the final dataset had 3655 observations across 38 variables.

At the data cleaning phase we also had to tidy movie budget and gross variables because :

  1. For some movies released outside the US, movie budget was expressed in local currency, so we had to modify data collection source code in order to be able to see the budget currency for each movie (originally, this information was dropped while parsing data scraped from IMDb website). Then we had to convert all the currencies to USD and to do that we scraped the exchange rates for the period 1953 - 2016 for all our currencies.

  2. To be able to analyse and compare budget and gross for movies released in different years, we had to account for inflation and standardize these variables to a single base (we chose 2016), to achieve that, we had to scrape the US Consumer Price Index for the period from 1900 - 2016 and adjust our budget and gross variables.

The next challenge during data wrangling phase was genres variable:it contained several values for each row, so in order to perform the genre wise analysis we had to create new dataframe and ‘unwind’ genres for each movie.

Following the data correlation matrix and our intuition we selected 9 variables that might influence movies IMDb score, they could be divided in three groups: movie production characteristics (title year, duration, genre), popularity characteristics (number of voted users, number of user reviews, movie facebook likes) and financial chracteristics (budget and gross).

The most evident factors that influence IMDb score are movie popularity characteristics(number of voted users and number of voted reviews) while others require more analysis to reveal the pattern: the relationship between movie financial characteristics (profit and ROI) is not visible untill you summarize them into mean and median values, only the plot becomes visible: high-score movies tend to have higher average profits and ROIs.

I think that the biggest success of this analysis is revealing the underlying relationship between movies budget and IMDb score: intuitively the positive correlation between these variables make sense but the actual correlation coefficient and bivariate analysis were not reflecting it.

The trend was uncovered while perfoming IMDb score vs movie budget analysis for each of movie genres separately: for those genres that are budget-prone (i.e. where the budget can improve the quality of a movie e.g.for movies involving a lot of graphics higher budget will allow better graphics and therefore the overall movie quality will become better, genres like SciFi, Action, Adventure etc) the budget increase is contributing to movie IMDb score, while for those movies that are less budget-prone (Romance, Documentary) this relationship is disapearing.

For future it will be very exciting to use all our findings to build a model and predict movie IMDb score. Since the data is quite diverse and there are no linear relationships between variables, the linear model is likely to show low results, so possibly we would need to resort to more complicated algorithms.